## library(devtools)
## library(BiocManager)
library(SingleCellExperiment)
library(ggplot2)
library(scFeatures) ## devtools::install_github("SydneyBioX/scFeatures")
library(ClassifyR) ## BiocManager::install("ClassifyR", dependencies = TRUE)
library(ggthemes)
library(spicyR) ## BiocManager::install("spicyR")
library(dplyr)
library(limma)
library(plotly)
library(scattermore)
library(tidyr)
library(survival)
library(survminer)
library(spatstat)
##library(spatstat.core) ## install.packages("spatstat")
##library(spatstat.geom) 
library(scater)
library(scran)

theme_set(theme_classic())

1 Overview

As single cell technology advances, the recent development of spatial omics allows us to examine the spatial organisation of cells within tissues in their native environment. This workshop will discuss the challenges and analytical focus associated with disease outcome prediction using multi-sample spatial datasets. We will also talk about general analytical strategies and the critical thinking questions that arise in the workflow.


Preparation and assumed knowledge

  • Knowledge of R syntax
  • Familiarity with the SingleCellExperiment class
  • Ability to install all required R packages, please check sessionInfo at the end of this document to ensure you are using the correct version.

Learning objectives

  • Understand and visualise spatial omics datasets
  • Explore various strategies for disease outcome prediction using spatial omics data
  • Understand and generate individual feature representations from a cell-level expression matrix
  • Develop appreciation on how to assess the performance of classification models
  • Perform disease outcome prediction using the feature representation and robust classification framework


Time outline

Structure for this 2-hour workshop:

Activity Time
Introduction to spatial technologies 20m
Cell segmentation with deep learning (with BIDCell) 20m
Exploring spatial data 20m
Break (Q&A) 10m
Extracting features from spatial data (with scFeatures) 30m
Performing disease outcome classification (with ClassifyR) 20m

2 Initial exploration and visualisation

Data and background

The widely-known METABRIC breast cancer cohort has recently had imaging mass cytometry generated for a subset of it. The publication describing this data is Imaging Mass Cytometry and Multiplatform Genomics Define the Phenogenomic Landscape of Breast Cancer, Nature Cancer, 2020. IMC has cell-level resolution. There are 483 cancer sample with IMC data. However, the subset of interest is those patients who do not have lymph node metastasis. Can their risk of recurrence accurately be predicted and therefore inform how aggressively they need to be treated? The other component of the analysis is patient clinical data, which has been sourced from Supplementary Table 5 of Dynamics of Breast-cancer Relapse Reveal Late-recurring ER-positive Genomic Subgroups, Nature, 2019.

Basic characteristics of the data objects:

  • The dataset contains 38 proteins and 76307 cells.
  • The outcome is recurrence-free survival.
load("~/data/breastCancer.RData")
data_sce <- IMC

print("data format")
## [1] "data format"
assay(data_sce)[1:5, 1:5]
##           MB-0002:345:93 MB-0002:345:107 MB-0002:345:113 MB-0002:345:114
## HH3_total     4.82032390       3.8229411      2.55845170      4.82208870
## CK19          1.21185160       1.3223530      0.13832258      0.33366668
## CK8_18        2.80274720       2.4720588      0.60545164      2.43351100
## Twist         0.14651649       0.1176471      0.09877419      0.20791112
## CD68          0.07863736       0.1016471      0.03225806      0.05237778
##           MB-0002:345:125
## HH3_total      2.41391110
## CK19           0.07055555
## CK8_18         0.20724444
## Twist          0.12004445
## CD68           0.00000000

The original data has been restricted to images with at least 400 cells, no lymph node cancer and Stage 1.

Aim

In this demo, we will fit survival models and evaluate the features selected to build them.

Exploration 1: How complex is my data?

At the start of the exploration, it is often good to get a sense of the complexity of the data. We usually do this by exploring the distribution of the outcomes and various variables in the patient’s meta-data. Here, we use cross-tabulation to examine the following variables:

  • Surgery vs death
  • ER status
  • Grade
print("Stage and death")
## [1] "Stage and death"
table(clinical$Breast.Surgery, clinical$Death, useNA = "ifany") 
##                    
##                      0  1
##   BREAST CONSERVING 38 14
##   MASTECTOMY        16  6
##   <NA>               2  1
print("Number of patients based on ER status")
## [1] "Number of patients based on ER status"
table(clinical$ER.Status)
## 
## neg pos 
##  12  64
print("Number of patients based on Grade")
## [1] "Number of patients based on Grade"
table(clinical$Grade)
## 
##  1  2  3 
## 14 30 28

Exploration 2: How to visualise my data?

Typically in single-cell data analysis, we perform dimension reduction to project the high dimensional cell by gene matrix on to 2D space. This allows us to visualise various things of interest, such as distribution of cell types and disease outcomes. In this dataset, cells were classified into 22 cell types based on their markers.

Note: for single-cell RNA-seq with around 20,000 genes, we often need to perform some filtering (e.g. select highly variable genes) to reduce the number of features. Here, given we have less than 50 proteins, there is no need to pre-filter. That being said, we provide some sample code below (commented out) to demonstrate how to identify highly variable genes followed by UMAP visualisation in scRNA-seq data.

# gene_var <- modelGeneVar(data_sce)
# hvgs <- getTopHVGs(gene_var, prop=0.1)
# data_sce <- runUMAP(data_sce, scale=TRUE,  subset_row = hvgs)
data_sce <- runUMAP(data_sce, scale=TRUE)
a <- plotUMAP(data_sce, colour_by = "description")
b <- plotUMAP(data_sce, colour_by = "metabricId")
a + b

Interactive Q&A:

What can we learn from these illustrations? Is there anything interesting in the plot? Questions to consider include:

  • Q1: Does each cell type cluster together?
  • Q2: When there is a large number of categories, are dimensionality reduction plots interpretable or misleading due to overplotting?

Exploration 3: Is there a spatial structure in my data?

The advantage with spatial omics is that we can examine the organisation of the cell types as it occurs on the tissue slide. Here, we visualise one of the slides from a patient. As an optional exercise, you can

  • permute the spatial coordinates

to give a sense of what is random ordering.

one_sample <- data_sce[, data_sce$metabricId  == "MB-0263"]
one_sample  <- colData(one_sample)
one_sample <- data.frame(one_sample)

tableau_palette <- scale_colour_tableau() 
color_codes <- tableau_palette$palette(10)

a <- ggplot(one_sample, aes(x = Location_Center_X , y = Location_Center_Y, colour = description)) + geom_point(alpha=0.7) +  scale_colour_manual(values = c(color_codes, "lightgrey")) + ggtitle("Original slide")
print("Optional: Permute the cell type labels")
## [1] "Optional: Permute the cell type labels"
one_sample$description_permute <- sample(one_sample$description)
b <- ggplot(one_sample, aes(x = Location_Center_X , y = Location_Center_Y, colour =description_permute)) + geom_point(alpha=0.7) +  scale_colour_manual(values = c(color_codes, "lightgrey")) + ggtitle("Permute the cell type label")

print("Optional: Permute the spatial coordinate")
## [1] "Optional: Permute the spatial coordinate"
one_sample$Location_Center_X_permute <- sample(one_sample$Location_Center_X)
one_sample$Location_Center_Y_permute <- sample(one_sample$Location_Center_Y)
c <- ggplot(one_sample, aes(x = Location_Center_X_permute , y = Location_Center_Y_permute, colour = description)) + geom_point(alpha=0.7) +  scale_colour_manual(values = c(color_codes, "lightgrey")) + ggtitle("Permute the X, Y coordinate")
a + b + c

Critical thinking:

  • Is there structure in the data real ?
  • What are additional strategies to generate a random distribution?

3 Describing tissue microenvrionments and cellular neighbourhoods

3.1 Do cell type co-localise in specfic regions?

We begin by examining how can we identify and visualise regions of tissue where spatial associations between cell-types are similar. There are many packages that perform this task andhere we use the lisaClust function that is based on “local L-function” to spatially cluster cells into different regions with similar cell type composition.

This has already been pre-loaded for you. Please proceed to the next task.

set.seed(51773)
BPPARAM <- MulticoreParam(16)

# Cluster cells into spatial regions with similar composition.
data_sce  <- lisaClust(
  data_sce ,
  k = 5,
  Rs = c(20, 50, 100),
  sigma = 50,
  spatialCoords = c("Location_Center_X", "Location_Center_Y"),
  cellType = "description",
  imageID = "ImageNumber" ,
  regionName = "region",
  BPPARAM = BPPARAM
)

3.2 Which regions appear to be different between young and old?

Visualise regions on individual level

metadata <- colData(data_sce)
metadata <- metadata[metadata$metabricId == "MB-0263",  ]
metadata <- data.frame(metadata)

plotlist <- list()
plotlist_celltype <- list()
thisregion  <-  unique(metadata$region)[1]

tableau_palette <- scale_colour_tableau() 
color_codes <- tableau_palette$palette(10)
color_codes <- c(color_codes, "darkgrey") 

names(color_codes) <- c(unique(metadata$description) ,  "other regions")

for ( thisregion in sort(unique(metadata$region))){
        
        selected_region_index <-  metadata$region == thisregion
        
        metadata$selected_region <-  "other regions"
        metadata$selected_region[selected_region_index] <- "selected region"
        
        metadata$celltype <- metadata$description
        metadata$celltype[!selected_region_index ] <-   "other regions"
        
        metadata$celltype <- factor(metadata$celltype, levels = c(unique(metadata$description), "other regions"))

       p <- ggplot(metadata, aes(x = Location_Center_X , y = Location_Center_Y , colour = selected_region  )) + geom_scattermore(pointsize = 1.5) + ggtitle(thisregion) + scale_colour_manual(values = c("grey" , "red"))
         
       
    
       p2 <-  ggplot(metadata, aes(x = Location_Center_X , y = Location_Center_Y , colour =  celltype )) + geom_scattermore(pointsize = 1.5) + ggtitle(thisregion) + scale_colour_manual(values =  color_codes)
       
      plotlist [[thisregion ]] <- p
       
      plotlist_celltype [[thisregion ]] <- p2
}

ggarrange(plotlist = plotlist , ncol = 5, nrow = 1 , common.legend = T )

ggarrange(plotlist = plotlist_celltype , ncol = 5, nrow = 1 , common.legend = T )

Visualise regions across patients

We can better interpret the region output by summarising the proportion of each cell type in a region across the patients. Looking at the composition of cell types in each region, comparing between Young and Old Patients. A cutoff of 50 years old will be used.

df <- data.frame(colData( data_sce))
clinical$AgeGroup <- ifelse(clinical$Age.At.Diagnosis < 50, "Young", "Old")

df_plot <- NULL
for ( thispatient in unique(df$metabricId)){
  this_df <- df[df$metabricId == thispatient, ]
  temp_df <-   table( this_df$region, this_df$description )
  temp_df <-  temp_df / rowSums(temp_df)
   temp_df <- data.frame(  temp_df)
  temp_df$patient <-  thispatient
  temp_df$group <- clinical[clinical$metabricId == thispatient, "AgeGroup"]
  df_plot <- rbind(df_plot, temp_df)
}

df_plot <- df_plot %>% dplyr::group_by( Var1 , Var2, group) %>% 
  summarise(mean_proportion = mean(Freq))
  
# r1 <- df_plot[ df_plot$Var1 == "region_1", ]  

ggplot(df_plot, aes(y = Var2, x = Var1 ,colour =mean_proportion  , size = mean_proportion ))+  geom_point() + 
  facet_grid(~group, scales = "free", space = "free" ) + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) + 
  xlab("Region" ) + ylab("Celltype") + scale_colour_viridis_c()

Interactive Q&A:

Q4: Which regions appear to be different between Young and Old?

df <- clinical[colData(data_sce)[, "metabricId"], ]
df$region <- data_sce$region
df$description <- data_sce$description

df <- df %>% dplyr::group_by(metabricId , AgeGroup, region) %>%
  count(description) %>%
  mutate(proportion = n / sum(n))


ggplot(df, aes(y = proportion, x = metabricId , fill = description))+ geom_col()+facet_grid(~region+AgeGroup, scales = "free", space = "free" ) + scale_fill_manual(values = c(color_codes))  +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

Interactive Q&A:

Q5: Does your conclusion change after looking at a different plot?

3.3 Further exploration by visualising selected regions

We see that region 1 appears to suggest the non-responder patients have more melano. Region 3 appears to be the tumor microenvironment with lots of Th.ae (antigen-experienced) and macro.mono (macrophage and monocytes) cell types. Let’s focus on region 1 and region 3 and visualise the data with boxplots, as well as comparing to the overall cell type proportion without segmenting into regions.

df_plot <- NULL
for ( thispatient in unique(df$metabricId)){
  this_df <- df[df$metabricId == thispatient, ]
  temp_df <-   table( this_df$region, this_df$description )
  temp_df <-  temp_df / rowSums(temp_df)
  temp_df <- data.frame(  temp_df)
  temp_df$patient <-  thispatient
  temp_df$group <- unique( this_df$AgeGroup)
  df_plot <- rbind(df_plot, temp_df)
}

df_plot_region_1 <- df_plot[df_plot$Var1 == "region_1", ]
 
a <- ggplot(df_plot_region_1, aes(x =  Var2,  y = Freq, colour = group)) +
  geom_boxplot()+ 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) + 
  ylab("Proportion") + xlab("Cell type")+ ggtitle("Region 1") + ylim(0, 0.25)


df_plot_region_3 <- df_plot[df_plot$Var1 == "region_3", ]

b <- ggplot(df_plot_region_3, aes(x =  Var2,  y = Freq, colour = group)) +
  geom_boxplot()+ 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))  + 
  ylab("Proportion") + xlab("Cell type") + ggtitle("Region 3") + ylim(0, 0.25)
 


overall <- NULL
for ( thispatient in unique(df$metabricId)){
  
  this_df <- df[df$metabricId == thispatient, ]
  
  temp_df <-   table(  this_df$description )
  temp_df <-  temp_df /sum(temp_df)
   temp_df <- data.frame(  temp_df)
  temp_df$patient <-  thispatient
  temp_df$group <- unique( this_df$AgeGroup )
  overall <- rbind(overall, temp_df)
}


c <- ggplot(overall, aes(x =  Var1,  y = Freq, colour = group)) +
  geom_boxplot()+ 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))  + 
  ylab("Proportion") + xlab("Cell type") + ggtitle("Overall composition")
 

a + b + c

Selecting a specific marker for further exploration

Often you may have a marker in mind to further examine the expression of key marker genes in these region specific cell types. For example, we select cells that have high Ki67 expression. (ie, only keeping the cells that have Ki67 expression higher than the median Ki67 expression in the whole dataset). We choose Ki67 as an example here because Ki67 is strongly associated with tumor cell proliferation and growth and is widely used as a biomarker in cancer analysis.

median_ki67 <- median( logcounts(data_sce[ "Ki67" , ]))
data_sce$ki67 <- ifelse( logcounts(data_sce[ "Ki67" , ]) > median_ki67, "high_ki67" , "low_ki67")


df_plot <- NULL
for ( thispatient in unique(df$metabricId)){
  this_df <- df[df$metabricId == thispatient, ]
  temp_df <-   table( this_df$region, this_df$description )
  temp_df <-  temp_df / rowSums(temp_df)
  temp_df <- data.frame(  temp_df)
  temp_df$patient <-  thispatient
  temp_df$group <- unique( this_df$AgeGroup )
  df_plot <- rbind(df_plot, temp_df)
}

df_plot_region_1 <- df_plot[df_plot$Var1 == "region_1", ]
 
a <- ggplot(df_plot_region_1, aes(x =  Var2,  y = Freq, colour = group)) +
  geom_boxplot()+ 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) + 
  ylab("Proportion") + xlab("Cell type")+ ggtitle("Region 1") + ylim(0, 0.25)


df_plot_region_3 <- df_plot[df_plot$Var1 == "region_3", ]

b <- ggplot(df_plot_region_3, aes(x =  Var2,  y = Freq, colour = group)) +
  geom_boxplot()+ 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))  + 
  ylab("Proportion") + xlab("Cell type") + ggtitle("Region 3") + ylim(0, 0.25)
 


overall <- NULL
for ( thispatient in unique(df$metabricId)){
  
  this_df <- df[df$metabricId == thispatient, ]
  
  temp_df <-   table(  this_df$description )
  temp_df <-  temp_df /sum(temp_df)
   temp_df <- data.frame(  temp_df)
  temp_df$patient <-  thispatient
  temp_df$group <- unique( this_df$AgeGroup )
  overall <- rbind(overall, temp_df)
}


c <- ggplot(overall, aes(x =  Var1,  y = Freq, colour = group)) +
  geom_boxplot()+ 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))  + 
  ylab("Proportion") + xlab("Cell type") + ggtitle("Overall composition") + ylim(0, 0.25)
 

a + b + c

Discussion:

Comparing the overall composition and the cell type composition in the region, what did you observe about the regions?

4 How do we generate a molecular representation for each individual?

In this demo, we use scFeatures to generate molecular representation for each patient from the matrix of proteins by cells. The molecular representation is interpretable and hence facilitates downstream analysis of the patient. Overall, scFeatures generates features across six categories representing different molecular views of cellular characteristics. These include:

  1. cell type proportions
  2. cell type specific gene expressions
  3. cell type specific pathway expressions
  4. cell type specific cell-cell interaction (CCI) scores
  5. overall aggregated gene expressions
  6. spatial metrics

The different types of features constructed enable a more comprehensive multi-view understanding of each patient from a matrix of proteins x cells.

Given in the previous section, we clustered the cells into regions, we can use the region information as an additional layer of information on top of the cell types to examine region-specific cell-type specific features.

region <- data_sce$region
region <- gsub("_" , "", region)
data_sce$celltype <- paste0( data_sce$description , "-" , region)
print("number of cells in each sample")
## [1] "number of cells in each sample"
table(data_sce$metabricId) 
## 
## MB-0002 MB-0064 MB-0128 MB-0130 MB-0132 MB-0136 MB-0138 MB-0142 MB-0145 MB-0154 
##    1111    1284    1005     870    1115    1648     509     744     982     338 
## MB-0168 MB-0175 MB-0180 MB-0192 MB-0201 MB-0208 MB-0225 MB-0227 MB-0232 MB-0240 
##     819    1002    1247     808     580     625    1036    1357     847    1026 
## MB-0245 MB-0246 MB-0249 MB-0252 MB-0255 MB-0256 MB-0258 MB-0260 MB-0263 MB-0275 
##    1282     859     474     794     932     801     499    1016     755     418 
## MB-0308 MB-0318 MB-0320 MB-0347 MB-0351 MB-0367 MB-0377 MB-0383 MB-0394 MB-0397 
##    1116    1380     847     789     803     552    1026     773    1201    1275 
## MB-0400 MB-0401 MB-0405 MB-0410 MB-0420 MB-0439 MB-0451 MB-0454 MB-0455 MB-0461 
##    1138     923     819    1138     532    1307     913    1195     605    1630 
## MB-0463 MB-0467 MB-0470 MB-0475 MB-0480 MB-0481 MB-0487 MB-0498 MB-0508 MB-0516 
##    1592     886     865    1742     593    1685     967    1338    1226    1086 
## MB-0518 MB-0520 MB-0531 MB-0536 MB-0571 MB-0584 MB-0591 MB-0596 MB-0604 MB-0605 
##     662    1393     607    1023     681    1078     940    1679    1182     634 
## MB-0628 MB-0635 MB-0636 MB-0642 MB-0661 MB-0662 MB-0663 
##    1730     836     524    1255    1553    1238     567
print("number of cells in each celltype - Region specific cell type")
## [1] "number of cells in each celltype - Region specific cell type"
table(data_sce$celltype) 
## 
##                  B cells-region1                  B cells-region2 
##                                5                              111 
##                  B cells-region3                  B cells-region4 
##                               11                              157 
##                  B cells-region5              Basal CKlow-region1 
##                               13                              125 
##              Basal CKlow-region2              Basal CKlow-region3 
##                              132                              104 
##              Basal CKlow-region4              Basal CKlow-region5 
##                              574                               28 
##              Endothelial-region1              Endothelial-region2 
##                               20                              253 
##              Endothelial-region3              Endothelial-region4 
##                               24                              238 
##              Endothelial-region5        Fibroblasts CD68+-region1 
##                               54                              146 
##        Fibroblasts CD68+-region2        Fibroblasts CD68+-region3 
##                              829                               42 
##        Fibroblasts CD68+-region4        Fibroblasts CD68+-region5 
##                              843                               80 
##              Fibroblasts-region1              Fibroblasts-region2 
##                              842                             3191 
##              Fibroblasts-region3              Fibroblasts-region4 
##                              296                             3894 
##              Fibroblasts-region5                    HER2+-region1 
##                              400                              149 
##                    HER2+-region2                    HER2+-region3 
##                               97                               26 
##                    HER2+-region4                    HER2+-region5 
##                              230                               42 
##                 HR- CK7--region1                 HR- CK7--region2 
##                             1924                              602 
##                 HR- CK7--region3                 HR- CK7--region4 
##                              667                             2557 
##                 HR- CK7--region5                 HR- CK7+-region1 
##                              824                              506 
##                 HR- CK7+-region2                 HR- CK7+-region3 
##                              443                             2823 
##                 HR- CK7+-region4                 HR- CK7+-region5 
##                             2635                              863 
##           HR- CKlow CK5+-region1           HR- CKlow CK5+-region2 
##                               50                              128 
##           HR- CKlow CK5+-region3           HR- CKlow CK5+-region4 
##                                6                              155 
##           HR- CKlow CK5+-region5                HR- Ki67+-region1 
##                                7                              117 
##                HR- Ki67+-region2                HR- Ki67+-region3 
##                              148                               61 
##                HR- Ki67+-region4                HR- Ki67+-region5 
##                              955                               30 
##           HR+ CK7- Ki67+-region1           HR+ CK7- Ki67+-region2 
##                              293                              158 
##           HR+ CK7- Ki67+-region3           HR+ CK7- Ki67+-region4 
##                              280                             1128 
##           HR+ CK7- Ki67+-region5           HR+ CK7- Slug+-region1 
##                              453                               58 
##           HR+ CK7- Slug+-region2           HR+ CK7- Slug+-region3 
##                               39                               28 
##           HR+ CK7- Slug+-region4           HR+ CK7- Slug+-region5 
##                              789                              147 
##                 HR+ CK7--region1                 HR+ CK7--region2 
##                             1652                             1719 
##                 HR+ CK7--region3                 HR+ CK7--region4 
##                             1775                             9299 
##                 HR+ CK7--region5              HRlow CKlow-region1 
##                             8606                             1742 
##              HRlow CKlow-region2              HRlow CKlow-region3 
##                              353                               86 
##              HRlow CKlow-region4              HRlow CKlow-region5 
##                              791                              271 
##                  Hypoxia-region1                  Hypoxia-region2 
##                               77                              165 
##                  Hypoxia-region3                  Hypoxia-region4 
##                               64                              616 
##                  Hypoxia-region5 Macrophages Vim+ CD45low-region1 
##                               23                               17 
## Macrophages Vim+ CD45low-region2 Macrophages Vim+ CD45low-region3 
##                              230                               19 
## Macrophages Vim+ CD45low-region4 Macrophages Vim+ CD45low-region5 
##                              275                               38 
##   Macrophages Vim+ Slug--region1   Macrophages Vim+ Slug--region2 
##                               15                              200 
##   Macrophages Vim+ Slug--region3   Macrophages Vim+ Slug--region4 
##                               16                              284 
##   Macrophages Vim+ Slug--region5   Macrophages Vim+ Slug+-region1 
##                               26                               49 
##   Macrophages Vim+ Slug+-region2   Macrophages Vim+ Slug+-region3 
##                              140                               37 
##   Macrophages Vim+ Slug+-region4   Macrophages Vim+ Slug+-region5 
##                              372                               88 
##            Myoepithelial-region1            Myoepithelial-region2 
##                               18                              262 
##            Myoepithelial-region3            Myoepithelial-region4 
##                              261                              700 
##            Myoepithelial-region5           Myofibroblasts-region1 
##                              150                              714 
##           Myofibroblasts-region2           Myofibroblasts-region3 
##                             4035                              510 
##           Myofibroblasts-region4           Myofibroblasts-region5 
##                             4774                              722 
##                  T cells-region1                  T cells-region2 
##                               72                             1161 
##                  T cells-region3                  T cells-region4 
##                              117                             1138 
##                  T cells-region5            Vascular SMA+-region1 
##                               99                               15 
##            Vascular SMA+-region2            Vascular SMA+-region3 
##                              256                               27 
##            Vascular SMA+-region4            Vascular SMA+-region5 
##                              363                               38

Discussion:

Are there any samples or cell types you would like to remove from the data?

How to create molecular representations of patients

All the feature types can be generated in one line of code. This runs the function using default settings for all parameters. For more information, type ?scFeatures. To save time, just load the prepared RDS file in the following chunk.

# scFeatures requires the following columns 
# celltype, sample, x_cord and y_cord
# alternatively, these can be also set as argument in the scFeatures function 
 
IMCmatrix <- assay(data_sce)
sample = data_sce$metabricId
celltype = data_sce$celltype
spatialCoords <- list(colData(data_sce)$Location_Center_X, colData(data_sce)$Location_Center_Y)

# here, we specify that this is a spatial proteomics data
# scFeatures support parallel computation to speed up the process 
scfeatures_result <- scFeatures(IMCmatrix, type = "spatial_p", sample = sample, celltype = celltype, spatialCoords = spatialCoords,
                                ncores = 32)
scfeatures_result <- readRDS("~/data/scfeatures_result.RDS")

4.1 How to explore scFeatures output?

We have generated a total of 13 feature types and stored them in a list. All generated feature types are stored in a matrix of samples by features. For example, the first list element contains the feature type proportion_raw, which contains the cell type proportion features for each patient sample. We could print out the first 5 columns and first 5 rows of the first element to see.

# we have generated a total of 13 feature types
names(scfeatures_result)
##  [1] "proportion_raw"       "proportion_logit"     "proportion_ratio"    
##  [4] "gene_mean_celltype"   "gene_prop_celltype"   "gene_cor_celltype"   
##  [7] "gene_mean_bulk"       "gene_prop_bulk"       "gene_cor_bulk"       
## [10] "L_stats"              "celltype_interaction" "morans_I"            
## [13] "nn_correlation"
# each row is a sample, each column is a feature 
data.frame(scfeatures_result[[1]][1:5, 1:5])
##         B.cells.region1 B.cells.region2 B.cells.region3 B.cells.region4
## MB-0002    0.0000000000     0.000000000               0      0.00000000
## MB-0064    0.0000000000     0.000000000               0      0.00000000
## MB-0128    0.0009950249     0.044776119               0      0.06666667
## MB-0130    0.0000000000     0.000000000               0      0.00000000
## MB-0132    0.0000000000     0.007174888               0      0.00000000
##         B.cells.region5
## MB-0002     0.000000000
## MB-0064     0.000000000
## MB-0128     0.007960199
## MB-0130     0.000000000
## MB-0132     0.000000000
## DT::datatable(meta_table , options = list(scrollX = TRUE))

5 Can we predict recurrence risk?

Building a classification model

In this section, we will perform disease outcome classification using the molecular representation of patients. Recall in the previous section that we have stored the 13 feature types matrix in a list. Instead of manually retrieving each matrix from the list to build separate models, classifyR can directly take a list of matrices as an input and run a repeated cross-validation model on each matrix individually. Below, we run 5 repeats of 5-fold cross-validation. The patient outcome is time-to-event, so, by default, ClassifyR will use Cox proportional hazards ranking to choose a set of features and also Cox proportional hazards to predict risk scores. Other models are available. A high score indicates prection of a worse outcome than a lower risk score.

usefulFeatures <- c("Breast.Tumour.Laterality", "ER.Status", "Inferred.Menopausal.State", "Grade", "Size")
nFeatures <- append(list(clinical = 1:3), lapply(scfeatures_result, function(metaFeature) 1:5))
clinicalAndOmics <- append(list(clinical = clinical), scfeatures_result)

### generate classfyr result 

classifyr_result <- crossValidate(clinicalAndOmics, c("timeRFS", "eventRFS"),
                    extraParams = list(prepare = list(useFeatures = list(clinical = usefulFeatures))),
                    nFeatures = nFeatures, nFolds = 5, nRepeats = 5, nCores = 5)

Cox proportional hazards is a traditional statistical method. Compare the predictive performance with a machine learning method. Random survival forest will be used. Such models can build remarkably complex relationships between features. Note, however, the running time is much longer than Cox proportional hazards and feature selection is used to limit the number of features considered to 100 at most per metafeature. To save time, just load the prepared RDS file.

nFeatures <- append(list(clinical = 1:3), lapply(scfeatures_result[2:length(scfeatures_result)], function(metaFeature) min(100, ncol(metaFeature))))
survForestCV <- crossValidate(clinicalAndOmics, outcome, nFeatures = nFeatures,
                classifier = "randomForest",
                nFolds = 5, nRepeats = 5, nCores = 5)
survForestCV <- readRDS("~/data/survForestCV.RDS")

Visualising the classification performance of individual metafeatures

To examine the distribution of prediction performance. use performancePlot. Currently, the only metric for time-to-event data is C-index and that will automatically be used because the predictive model type is tracked inside of the result objects.

tilt <- theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
performancePlot(append(classifyr_result, survForestCV),
                characteristicsList = list(x = "Assay Name", row = "Classifier Name"),
                orderingList = list("Assay Name" = names(scfeatures_result))) + tilt

Note how the resultant plot is a ggplot2 object and can be further modified.

The same code could be used for a categorical classifier because the random forest implementation provided by ranger package has the same interface for both.

Next, examine feature selection stability with selectionPlot.

selectionPlot(append(classifyr_result, survForestCV),
                characteristicsList = list(x = "Assay Name", row = "Classifier Name"),
                orderingList = list("Assay Name" = names(scfeatures_result))) + tilt

distribution(classifyr_result[[1]], plot = FALSE)
##      assay                   feature proportion
## 1 clinical Inferred.Menopausal.State          1

Using samplesMetricMap compare the per-sample C-index for Cox and Random Forest models for metafeature gene_cor_celltype.

library(grid)
heatmap <- samplesMetricMap(list(classifyr_result[[7]], survForestCV[[7]]))
grid.draw(heatmap)

A few samples are predicted better by one model than another.

The per-sample C-index is a metric unique to ClassifyR. Models and feature selection approaches may be seen in the vignette or listed by available().

Interactive Q&A:

Q8: Is highest predictive performance the only way to choose the best model or are other models the best for other reasons?

5.1 PART A: Explanation of spatial features

  • L function:

The L function is a spatial statistic used to assess the spatial distribution of cell types. It assesses the significance of cell-cell interactions, by calculating the density of a cell type with other cell types within a certain radius. High values indicate spatial association, low values indicate spatial avoidance.

one_sample  <- data_sce[ , data_sce$metabricId == "MB-0128"  ]
one_sample <- data.frame( colData(one_sample) )

one_sample$celltype <- one_sample$description
index <-  one_sample$celltype  %in% c("B cells", "Fibroblasts")
one_sample$celltype[!index] <- "others"
a <-ggplot( one_sample, aes(x = Location_Center_X , y = Location_Center_Y, colour = celltype )) + geom_point()  + scale_colour_manual(values = color_codes)  + ggtitle( "Patient MB-0128 - high L value with \n B cells interacting Fibroblasts")
 

one_sample$celltype <- one_sample$description
index <-  one_sample$celltype  %in% c("melano", "Tc.ae")
one_sample$celltype[!index] <- "others"
b <- ggplot( one_sample, aes(x = Location_Center_X , y = Location_Center_Y, colour = celltype )) + geom_point()  + scale_colour_manual(values = color_codes)  + ggtitle( "Patient MB-0128 - low L value with \n B cells interacting HR_ CK7")
 
a + b

  • Cell type interaction composition:

We calculate the nearest neighbours of each cell and then calculate the pairs of cell types based on the nearest neighbour. This allows us to summarise it into a cell type interaction composition.

one_sample  <- data_sce[ , data_sce$metabricId == "MB-0263"  ]
one_sample <- data.frame( colData(one_sample) )

one_sample$celltype <- one_sample$description
 
a <-ggplot( one_sample, aes(x = Location_Center_X , y = Location_Center_Y, colour = celltype )) + geom_point()  + scale_colour_manual(values = color_codes)  + ggtitle( "Patient MB-0263")


feature  <- scfeatures_result$celltype_interaction
to_plot <- data.frame( t( feature["MB-0263", ])  )
to_plot$feature <- rownames(to_plot) 
colnames(to_plot)[2] <- "celltype interaction composition"
 
to_plot <- to_plot[ to_plot$MB.0263 > 0.03 , ] 
b <- ggplot(to_plot, aes(x = reorder(`celltype interaction composition`, MB.0263) ,  y = MB.0263, fill=`celltype interaction composition`)) + geom_bar(stat="identity" ) + ylab("Major cell type interactions")  +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) 

a+ b

  • Moran’s I:

Moran’s I is a spatial autocorrelation statistic based on both location and values. It quantifies whether similar values tend to occur near each other or are dispersed.

high  <- data_sce["Ki67", data_sce$metabricId == "MB-0132"  ]
high_meta <- data.frame( colData(high) ) 
high_meta$expression <- as.vector(logcounts( high)) 

low  <- data_sce["Ki67",  data_sce$metabricId == "MB-0249" ]
low_meta <- data.frame( colData(low) )
low_meta$expression <- as.vector(logcounts(low))


a <- ggplot(high_meta, aes(x = Location_Center_X , y = Location_Center_Y, colour =expression)) + geom_point(alpha=0.5) + scale_colour_viridis_c() + ggtitle("Patient MB-0132 - high Moran's I in Ki67")

b <- ggplot(low_meta, aes(x = Location_Center_X , y = Location_Center_Y, colour =expression)) + geom_point(alpha=0.5) + scale_colour_viridis_c() + ggtitle("Patient MB-0249 - low Moran's I in Ki67")

a+b

  • Nearest Neighbor Correlation:

This metric measures the correlation of proteins/genes between a cell and its nearest neighbour cell.

 plot_nncorrelation <- function(thissample , thisprotein){
   
       sample_name <- thissample
       thissample <- data_sce[, data_sce$metabricId ==     sample_name]
    
      
      exprsMat <- logcounts(thissample)
     
    
    cell_points_cts <- spatstat.geom::ppp(
            x = as.numeric(thissample$Location_Center_X ), y = as.numeric(thissample$Location_Center_Y),
            check = FALSE,
            xrange = c(
                min(as.numeric(thissample$Location_Center_X)),
                max(as.numeric(thissample$Location_Center_X))
            ),
            yrange = c(
                min(as.numeric(thissample$Location_Center_Y)),
                max(as.numeric(thissample$Location_Center_Y))
            ),
            marks = t(as.matrix(exprsMat))
        )
    
     value <-  spatstat.explore::nncorr(cell_points_cts)["correlation", ]
      value <-  value[  thisprotein]
     
    # Find the indices of the two nearest neighbors for each cell
    nn_indices <- nnwhich(cell_points_cts, k = 1)
    
    protein <-  thisprotein
    df <- data.frame(thiscell_exprs  = exprsMat[protein, ] , exprs =  exprsMat[protein,nn_indices ])
    
   p <-  ggplot(df, aes( x =thiscell_exprs ,  y = exprs , colour =  exprs  )) +
      geom_point(alpha = 0.3) + ggtitle(paste0( "Patient ", sample_name ,  " nn_corr = " ,  round(value, 2)  )) + scale_colour_viridis_c() 
   
   return (p ) 

}

    
p1 <- plot_nncorrelation("MB-0605",  "HER2")
p2 <- plot_nncorrelation("MB-0258",  "HER2")
p1  + p2

The correlation differs between the 42RD patient (from cluster 1) and the 29RD patient (from cluster 2).

5.2 PART C: SessionInfo

sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 12 (bookworm)
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.21.so;  LAPACK version 3.11.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: Australia/Sydney
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] grid      stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] scran_1.28.2                scater_1.28.0              
##  [3] scuttle_1.10.2              spatstat_3.0-6             
##  [5] spatstat.linnet_3.1-1       spatstat.model_3.2-6       
##  [7] rpart_4.1.19                spatstat.explore_3.2-3     
##  [9] nlme_3.1-162                spatstat.random_3.1-6      
## [11] spatstat.geom_3.2-5         spatstat.data_3.0-1        
## [13] survminer_0.4.9             ggpubr_0.6.0               
## [15] tidyr_1.3.0                 scattermore_1.2            
## [17] plotly_4.10.2               limma_3.56.2               
## [19] dplyr_1.1.3                 spicyR_1.12.2              
## [21] ggthemes_4.2.4              ClassifyR_3.5.21           
## [23] survival_3.5-3              BiocParallel_1.34.2        
## [25] MultiAssayExperiment_1.26.0 generics_0.1.3             
## [27] scFeatures_0.99.27          ggplot2_3.4.3              
## [29] SingleCellExperiment_1.22.0 SummarizedExperiment_1.30.2
## [31] Biobase_2.60.0              GenomicRanges_1.52.1       
## [33] GenomeInfoDb_1.36.4         IRanges_2.34.1             
## [35] S4Vectors_0.38.2            BiocGenerics_0.46.0        
## [37] MatrixGenerics_1.12.3       matrixStats_1.0.0          
## 
## loaded via a namespace (and not attached):
##   [1] SpatialExperiment_1.10.0   R.methodsS3_1.8.2         
##   [3] GSEABase_1.62.0            progress_1.2.2            
##   [5] EnsDb.Mmusculus.v79_2.99.0 goftest_1.2-3             
##   [7] Biostrings_2.68.1          HDF5Array_1.28.1          
##   [9] vctrs_0.6.3                digest_0.6.33             
##  [11] png_0.1-8                  shape_1.4.6               
##  [13] EnsDb.Hsapiens.v79_2.99.0  ggrepel_0.9.3             
##  [15] deldir_1.0-9               parallelly_1.36.0         
##  [17] magick_2.8.0               MASS_7.3-58.3             
##  [19] reshape2_1.4.4             httpuv_1.6.11             
##  [21] foreach_1.5.2              withr_2.5.1               
##  [23] xfun_0.40                  ellipsis_0.3.2            
##  [25] memoise_2.0.1              proxyC_0.3.3              
##  [27] ggbeeswarm_0.7.2           zoo_1.8-12                
##  [29] GlobalOptions_0.1.2        gtools_3.9.4              
##  [31] SingleCellSignalR_1.12.0   pbapply_1.7-2             
##  [33] R.oo_1.25.0                prettyunits_1.2.0         
##  [35] KEGGREST_1.40.1            promises_1.2.1            
##  [37] httr_1.4.7                 rstatix_0.7.2             
##  [39] restfulr_0.0.15            globals_0.16.2            
##  [41] fitdistrplus_1.1-11        rhdf5filters_1.12.1       
##  [43] rhdf5_2.44.0               rstudioapi_0.15.0         
##  [45] miniUI_0.1.1.1             concaveman_1.1.0          
##  [47] babelgene_22.9             curl_5.1.0                
##  [49] zlibbioc_1.46.0            ScaledMatrix_1.8.1        
##  [51] polyclip_1.10-6            GenomeInfoDbData_1.2.10   
##  [53] xtable_1.8-4               stringr_1.5.0             
##  [55] evaluate_0.22              S4Arrays_1.0.6            
##  [57] BiocFileCache_2.8.0        hms_1.1.3                 
##  [59] irlba_2.3.5.1              colorspace_2.1-0          
##  [61] filelock_1.0.2             ROCR_1.0-11               
##  [63] reticulate_1.32.0          magrittr_2.0.3            
##  [65] lmtest_0.9-40              viridis_0.6.4             
##  [67] later_1.3.1                lattice_0.20-45           
##  [69] future.apply_1.11.0        XML_3.99-0.14             
##  [71] cowplot_1.1.1              ggupset_0.3.0             
##  [73] RcppAnnoy_0.0.21           pillar_1.9.0              
##  [75] iterators_1.0.14           caTools_1.18.2            
##  [77] compiler_4.3.1             beachmat_2.16.0           
##  [79] stringi_1.7.12             tensor_1.5                
##  [81] minqa_1.2.5                GenomicAlignments_1.36.0  
##  [83] plyr_1.8.9                 msigdbr_7.5.1             
##  [85] crayon_1.5.2               abind_1.4-5               
##  [87] BiocIO_1.10.0              locfit_1.5-9.8            
##  [89] sp_2.1-0                   bit_4.0.5                 
##  [91] codetools_0.2-19           BiocSingular_1.16.0       
##  [93] bslib_0.5.1                multtest_2.56.0           
##  [95] mime_0.12                  splines_4.3.1             
##  [97] circlize_0.4.15            Rcpp_1.0.11               
##  [99] dbplyr_2.3.4               sparseMatrixStats_1.12.2  
## [101] knitr_1.44                 blob_1.2.4                
## [103] utf8_1.2.3                 AnnotationFilter_1.24.0   
## [105] lme4_1.1-34                listenv_0.9.0             
## [107] DelayedMatrixStats_1.22.6  GSVA_1.48.3               
## [109] ggsignif_0.6.4             tibble_3.2.1              
## [111] Matrix_1.6-1               scam_1.2-14               
## [113] statmod_1.5.0              tweenr_2.0.2              
## [115] pkgconfig_2.0.3            pheatmap_1.0.12           
## [117] tools_4.3.1                cachem_1.0.8              
## [119] RSQLite_2.3.1              viridisLite_0.4.2         
## [121] DBI_1.1.3                  numDeriv_2016.8-1.1       
## [123] fastmap_1.1.1              rmarkdown_2.25            
## [125] scales_1.2.1               ica_1.0-3                 
## [127] Seurat_4.4.0               Rsamtools_2.16.0          
## [129] broom_1.0.5                sass_0.4.7                
## [131] patchwork_1.1.3            BiocManager_1.30.22       
## [133] graph_1.78.0               carData_3.0-5             
## [135] RANN_2.6.1                 farver_2.1.1              
## [137] mgcv_1.8-42                yaml_2.3.7                
## [139] rtracklayer_1.60.1         cli_3.6.1                 
## [141] purrr_1.0.2                leiden_0.4.3              
## [143] lifecycle_1.0.3            uwot_0.1.16               
## [145] bluster_1.10.0             backports_1.4.1           
## [147] DropletUtils_1.20.0        annotate_1.78.0           
## [149] gtable_0.3.4               rjson_0.2.21              
## [151] ggridges_0.5.4             progressr_0.14.0          
## [153] parallel_4.3.1             ape_5.7-1                 
## [155] jsonlite_1.8.7             edgeR_3.42.4              
## [157] bitops_1.0-7               bit64_4.0.5               
## [159] Rtsne_0.16                 spatstat.utils_3.0-3      
## [161] BiocNeighbors_1.18.0       SeuratObject_4.1.4        
## [163] RcppParallel_5.1.7         jquerylib_0.1.4           
## [165] metapod_1.8.0              dqrng_0.3.1               
## [167] survMisc_0.5.6             R.utils_2.12.2            
## [169] lazyeval_0.2.2             shiny_1.7.5               
## [171] htmltools_0.5.6.1          KMsurv_0.1-5              
## [173] sctransform_0.4.0          rappdirs_0.3.3            
## [175] ensembldb_2.24.1           glue_1.6.2                
## [177] XVector_0.40.0             RCurl_1.98-1.12           
## [179] gridExtra_2.3              AUCell_1.22.0             
## [181] boot_1.3-28.1              igraph_1.5.1              
## [183] R6_2.5.1                   gplots_3.1.3              
## [185] labeling_0.4.3             km.ci_0.5-6               
## [187] GenomicFeatures_1.52.2     cluster_2.1.4             
## [189] Rhdf5lib_1.22.1            nloptr_2.0.3              
## [191] vipor_0.4.5                DelayedArray_0.26.7       
## [193] tidyselect_1.2.0           ProtGenerics_1.32.0       
## [195] ggforce_0.4.1              xml2_1.3.5                
## [197] car_3.1-2                  AnnotationDbi_1.62.2      
## [199] future_1.33.0              rsvd_1.0.5                
## [201] munsell_0.5.0              KernSmooth_2.23-20        
## [203] data.table_1.14.8          htmlwidgets_1.6.2         
## [205] RColorBrewer_1.1-3         biomaRt_2.56.1            
## [207] rlang_1.1.1                spatstat.sparse_3.0-2     
## [209] lmerTest_3.1-3             fansi_1.0.5               
## [211] beeswarm_0.4.0

5.3 Acknowledgment

The authors thank all their colleagues, particularly at The University of Sydney, Sydney Precision Data Science and Charles Perkins Centre for their support and intellectual engagement. Special thanks to Ellis Patrick, Shila Ghazanfar, Andy Tran for guiding and supporting the building of this workshop.